PROSPER LOAN DATA by JULIA KOLLOFRATH

Prepare the workspace

In this work the loan data from Prosper (https://www.prosper.com/home/) will be analyzed using R. The dataset was provided by Udacity (https://www.google.com/url?q=https://s3.amazonaws.com/udacity-hosted-downloads/ud651/prosperLoanData.csv&sa=D&ust=1526295908174000) and can also be found on Kaggle (https://www.kaggle.com/jschnessl/prosperloans).

Prosper is a peer-to-peer lending marketplace (https://en.wikipedia.org/wiki/Prosper_Marketplace) providing personal loans for borrowers and funds from 2000 - 35000 dollars to individual or institutional investors. The data set contains details to 113937 entries including scheduled monthly loan payment, borrowers income, occupation, employment status, former loans and debt to income ratio. In total, the dataset contains 81 variables.

Univariate Plots Section

The loans

First, let’s get an overview of the loans provided by Prosper.

The histogram of loan amounts reveals a distribution skewed to lower loans and ranging from 1000 to 35000 dollars (this is in accordance with the platforms conditions). The median of the loan amounts is 6500 dollars, but the high numbers of loans at 10000, 15000, 20000 and 25000 dollars are remarkable: customers seem to prefer to take a loan on a “round” amount of money. This feature can be observed better when lowering the binwidth to 250 dollars: integer thousands loan amounts have far higher counts, especially below 10000 dollars.

The distribution of scheduled monthly loan payments range from 0 to 2252 dollars with a median at 218 dollars. Above 750 $ monthly payments get quite rare.

Term expresses the length of the loan in month. It seems that the platform only offers terms of 12, 36 and 60 month. The majority of borrowers decide for a 36 month term. The existance of the 12 month term is enigmatic as on the website today only 36 and 60 month terms are available, but before 2013 a 12 month term was available (according to https://www.lendacademy.com/36-60-month-loans-lending-club-and-prosper/).

Let’s plot histograms of monthly payments by term:

While loans with a term of 60 month have a mean amount of 12370 dollars, those with a term of 36 month have a mean amount of 7276 dollars. Term of 12 months were chosen for even lower loan amounts with a mean of 4694 dollars. This could be expected as a higher amount of loan will take longer to pay back. So how do the monthly payments differ for the differnt term options?

These plots follow the same logic as the plots before. This is getting too boring, so let’s explore something different.

Let’s explore what the loans are taken for using the variable listing category.

By far the most loans are taken for Debt Consolidation. The levels “Not Available” and “Other” are not very helpful here for exploring. Home Improvement and Business are other common reasons for a loan.

60 months loans are almost only taken on for Debt Consolidation and also a bit for Home Improvement and Business (and Other), for other reasons credits with this long term were not chosen. Maybe that is due to a higher loan taken for these categories?

Let’s check on the mean loan amounts taken for each category:

We can see that the categories we have seen before in the 60-month-term facet are among the categories with the highest mean loan amounts, besides Baby & Adoption, Boat, Engagement Ring and Wedding. Interestingly, the category Other no high mean loan amount, but people seem to take 60 month to pay this loan back. However, there are quite a few outliers in this category towards higher amounts, so this could be the credits we see in the 60-month-term facet above. We will check on loan amounts vs term or in which category the most defaults will happen later but for now we will go on with the question:

How did Prosper evolve since its start? How many loans are granted each Quarter?

Huh, it seems Prosper went into some issues in Q4 2008 and retreated in Q1 and Q2 2009. This coincides with the start of the international banking crisis in September 2008. A quick look into the wikipedia page of Prosper reveals that the SEC imposed a cease and desist order on Prosper in November 2008 and after clarification in July 2009 Prosper relaunched.

Let’s have a quick check when 12-months terms were available, as we read about them being discontinued before.

In this plot it is obvious that before Q1 2011 only 36-month terms were offerd and then expanded to 12- and 60-months terms. The 12-month term was stopped after Q2 2013.

Let’s check for the evolution of Listing Categories.

What can be seen here, is that all Listing categorys were rolled out in Q4 2007. The category Personal Loan existed only for three month and was stopped after Prospers relaunch in 2009. Some more categories like cosmetic procedure or engagement ring was rolled out after Q1 2012. If we run into calculating percentages of loan categories lateron we should keep these informations in mind.

Let’s have a look at how many loans get paid back and how many not.

There seem to be too many categories to answer this question quickly. We have to make some rearrangments to the categories above. We can that a ‘final payment in progress’ is close to a completed loand and include those into the category ‘completed’. We can give borrowers a 14 days’ grace and include ‘PastDue (1-15 days)’ into the category ‘current’. ‘charged off’, ‘cancelled’ and all the other ‘PastDue (…)’ will be included to ‘default’ loans. Then we can calculate the frequencies of the three

## 
## defaulted completed   current 
## 0.1604044 0.3359664 0.5036292

Now we see that 50% of the loans are still running and 34% got paid back and 16% defaulted. In my opinion, that is quite a high percentage failing to pay back.

The Borrowers

After examining the loans granted by Prosper, let’s now have a closer look at the borrowers.

The histogram of monthly incomes is skewed towards lower incomes, meaning more people with lower incomes tend to take a loan, which makes sense I guess. Most Borrowers to date only hold one Prosper loan and are employed or work full-time. Only a small fraction of the Borrowers have different employment statuses. About 50% of the Borrower do own a home.

Let’s have a look at the occupations of the Borrowers.

Most Borrowers answered “other” which is not very helpful for us in that moment. Also the Occupation description ‘Professional’ was used often but does not lead to deeper insights. What is interesting that Computer Programmers have the higherst count here, meaning of all occupations computer programmers have the most loans. This might be strange at first, as you would expect that computer programmers get paid fairly well, so what will they need so many loans for? Are they living like a lord? I think the reason for this is that Prosper is a online marketplace for loans and who will tend to get a loan online rather than in a bank branch? Someone who spends a lot of time in front of a computer and who has a high trust in tech companies, thus computer programmers might use Prosper more often than people with other occupations.

Univariate Analysis

What is the structure of your dataset?

The data set contains 113937 rows of 81 variables.

What is/are the main feature(s) of interest in your dataset?

The variable of main interest are LoanOriginalAmount, MonthlyLoanPayment, Occupation, DebtIncomeRatio, ProsperRating, LoanStatus and ListingCategory.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

EmploymentStatus, IsBorrowerHomeowner, EmploymentDuration and LoanFirstDefaultedCycleNumber could bring more insights.

Did you create any new variables from existing variables in the dataset?

I compressed the LoanStatus to only three categories as this variable seemed to contain far too many details for a quick glimpse on the data.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

I reordered levels on LoanOriginationQuarter and LoanStatus to make cleaner plots.

Bivariate Plots Section

The Prosper loan data can be used for an little insight into “Who earns What?”: A boxplot of the stated monthly income sorted by occupation reveals the medians and spreads of salaries in the different occupations.

Students have the lowest income, naturally. The slope of the median incomes seems to be somewhat expected so no big surprises here. There are quite a few outliers stating they have no monthly income. Are these retired people?

We should keep in mind that the sample drawn from the population by using the Prosper loan data sets might not be representative and there might be a tendency of underestimating the income for each group as only the not so well earning individuals of a specific group tend to take a loan and also some borrowers stated to work part time. So let’s replot with a separation into full-time and part-time workers.

Here we can see the influence of part-time work on the monthly income per occupation. For students and lower incomes there is not so much of a difference in income whether you work part time or full time but with increasing income the gap widens before for higher income the part-time work incomes start to catch up to the full-time incomes. Unfortunately, it is not quantified in the dataset by how much working hours are reduced when calling it part-time work.

## `geom_smooth()` using method = 'gam'

Let’s leave this exploration of incomes vs occupation at this point and get back to the loaning topic. How much do Borrowers tend to loan based on their income?

## `geom_smooth()` using method = 'gam'

The first thing popping to the eye is that you seem to have at least a monthly income of 7500 dollars to get granted a loan above 250000 dollars by Prosper. When trying to smooth the data points, there is a hinge at very low income, which is due to the very high amount on incomes stated to be zero (why do those people get granted a loan by Prosper?), as well as the loan amount is starting to decrease again at higher incomes. This could be due to the fact that the more income you have the less you would have to loan. but we should also keep in mind that data gets sparser and sparser at incomes above 20000 dollars.

Let’s check if these incomes are verified.

## `geom_smooth()` using method = 'gam'

Well, that’s a pretty obvious picture that almost all incomes of 0 dollars are not verified. Let’s filter our dataset to only verified incomes.

## `geom_smooth()` using method = 'gam'

This removed the hinge at very low incomes a bit. But still, I wonder here: Why would someone with a 40000 USD monthly income ask for a loan of 3000 USD? Are these some people trying out how Prosper works before engaging as Investor?

## `geom_smooth()` using method = 'gam'

Here, as expected, the number of investors seems to increase with the amount of loan. But even up until loans of 35000 dollars there are loans which are financed by just one investor.

cor.test(loan$LoanOriginalAmount, loan$MonthlyLoanPayment)
## 
##  Pearson's product-moment correlation
## 
## data:  loan$LoanOriginalAmount and loan$MonthlyLoanPayment
## t = 867.82, df = 113940, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9312165 0.9327426
## sample estimates:
##       cor 
## 0.9319837

Loan amount and monthly loan payment do have a very strong positive correlation. We will examine this relationship to detail in the next part.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

Of course it makes a difference whether you are a homemaker or a judge - you will have a differnt income. Occupations were ranked by median income and the influence of part-time work was plotted. Monthly income and loan amount are related as well. Of course, you can afford a higher loan if you have a higher income.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

The number of investors is generally increasing with a increasing loan amount.

What was the strongest relationship you found?

The monthly loan payment is strongly correlated to the Loan Amount.

Multivariate Plots Section

Let’s see how much you would have to pay back to Prosper per month dependent on the loan amount.

## `geom_smooth()` using method = 'gam'

This plot is quite interesting. The monthly rate the borrower has to pay is dependent on the loaned amount but there are three different relationships in the plot. Coloring the markers using Term helps revealing that the monthly loan payment is dependent on the loan amount and the term. There seems to be some more variables taken into account to determine the monthly loan payment as there is still some jitter. Maybe Prosper’s rating of the borrowers?

Let’s facet the plots by term to have a clearer look at the data.

Here it can be seen that for worse Prosper Ratings the monthly loan ratings will be higher for the same loan amount. High risk “HR” rated borrowers will have to pay the highest monthly loan payment. However, it seems that for the 12 months terms this rule seems not to be so riguous and that for the 36 months terms there seem to be quite a few exceptions of borrowers with a worse prosper rating who nonetheless seem to pay a lower monthly payment. Another fact to be drawn from this plot is that the better the Prosper Rating the higher the maximum loan amount. You would have to get a Propser Rating of AA, A or B to get a 35000 USD loan, C for a 25000 USD loan and with a Prosper Rating of D, E or HR the maximum loan amount is 15000 USD.

Let’s focus now on the question whether investors will get their money back.

Here we can see that for each loan grouped by Prosper Rating defaulted loans were paid back by borrowers with a lower mean income. Interestingly, for ratings better than C current loans are taken by higher incomes than the completed ones, in contrast to borrowers rated D or worse where it is the other way round. When do Borrowers take on too much? What is a good Debt to Income Ratio?

Let’s have a look at the Debt to Income Ration.

## Warning: Transformation introduced infinite values in continuous x-axis
## Warning: Removed 7296 rows containing missing values (geom_point).

Here it is interesting, that the Prosper Rating worked quite well, as from AA, A and B rated Borrowers only a few defaulted to this point. But apart from that, it is quite difficult to make out any difference an these plot. It seems the Debt/income ration is has no high impact on the Prosper Rating. Thus, forecasting a defaulting loan is a quite hard task.

Here we can compare different Listing Categories with each other where defaulting does happen more often: Green Loans do have a high defaulting rate, as well as Baby&Adoption, Medical/Dental and Household Expenses (and Not Available). As a Investor you should fund a loan for Recreating Vehicle, Motorcycle, Boat or Engagement Ring, these seem to be the safest options.

What occupations can you trust in as an investor?

Occupation with a high default ratio are Realtor and Nurse Aide, but also Food Service, Laborer and Construction and Student College Sophomore, Student Community College and Technical School.

What about the ratings for the occupations?

Interestingly, nevertheless Technical School Students were among the occuptions with a higher default ratio in the plot above, here they are extraordinarily good rated with a high frequency in A and B. HR ratings are common among College Junior Students and College Sophomore Students. But also Teachers Aides and Military Enlisted are not rated well by Prosper.

Finally, let’s check on the number of loans people get on Prosper. Are there people getting multiple loans? So let’s explore their “story”.

First, we see here that before 2008 loans had no category, so for older loans the category will be not available. So here are the Prosper “stories”: * A computer programmer with increasing income, who keeps on loaning for debt consolidation which sound like he is in a vicious circle. * A person of unknown profession who loans small amounts for student use and then debt consolidation and finally home improvement. * A Administrative Assistant, who took the forth loan for debt consolidation and all loans after in the category “other”, thus we do not get a lot information here. * A person of unknown professionwith slightly increasing salaray taking loans for debt consolidation then for business use and afterwards even larger amounts for debt consolidation. * A construction worker with at first no income asking for loans for business and home improvement and finally for debt consolidation - in the end he hasa quite high income but still needs high amounts of money for debt consolidation. * A Teacher starting taking loans for business use, other and a larger amount for home improvement and then varying amounts for debt consolidation. * A Professional with quite high income who takes mainly loans for debt consolidation but also for buying an Auto * A Nurse who needs large amounts to improve her home and some debt consolidation after a decreasement in her salary. * A borrower of unknown profession who takes loans for debt consolidation and home improvement until 2011 but then needs loans for medical treatments (and other). * A professional with a quite high income who solely takes loans for improving the house.

Here we could go on exploring more about debt consolidation. Is a credit of this category rated worse? But I have to stop this EDA at some point…

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

With a low Prosper Rating the monthly loan ratings will be higher for the same loan.

Were there any interesting or surprising interactions between features?

Technical School Students were among the occuptions with a higher default ratio but are generally extraordinarily good rated with a high frequency in A and B. I was also surprised that there are so many differnces between students in rating.

OPTIONAL: Did you create any models with your dataset? Discuss the
strengths and limitations of your model.


Final Plots and Summary

Plot One

Description One

I like this plot as it shows such a clear relationship of loan amount, monthly payment and prosper rating. The more you loaned the higher is your monthly payment. If you decided for the shorter of the two possibles terms (36 and 60 months) you would have to pay more per month and if you are in a lower prosper rating like even HR for high risk you would have to pay more per month.

Plot Two

Description Two

I chose this plot as occupation, monthly income and other features of the borrower might contribute to the Prosper rating, but the Listing Category could be taken into account as well.

Plot Three

Description Three

I chose this one as third plot as it shows the evolution of the Prosper platform with the years pretty well. How they introduced Loan purpose categories, how they expanded the categories. How they closed down the platform and relaunched it. How the manage more and more loans with the years. How it evolved to be a popular adress for debt consolidation.


Reflection

I went through a few problems when exploring the dataset. The first was the tremendous amount of variables and to understand what they all could mean as I am by no means familiar with the loan topic. I found quite hard that there are a lot categorical variables or discrete variables which does not leave you a lot to plot but barplots. What went well it that I got quite familiar with R in the end and got confidence that i can find answers to any question I might have on the web. In the end the time was short, and there is so much more to explore. I would have liked to polish up the “Story”plot but went into quite a few problems in the end which took me too much time to solve. I would like to do more plots on the prosper rating and how they came up with it, I guess it is quite complicated to reconstruct it. I would have been interested in dividing up the data before and after the relaunch and look for differences there: how did the rating evolve, did they better with putting “risk” borrowers into the HR rating? I would have done more plots with a timeline as x-axis to see the evolution of the platform. I got drawn away for a while by the income variable and spent a lot of time on that. I did not find THE variables which could give me a detailed insight into a borrower so I could train a model, which I would have liked to do R.